Useful third-party libraries: exercises

Biopython

Can you count the number of sequences in the data/proteome.faa file?


In [1]:
from Bio import SeqIO

counter = 0

for seq in SeqIO.parse('../data/proteome.faa', 'fasta'):
    counter += 1
    
counter


Out[1]:
4306

Can you plot the distribution of protein sizes in the data/proteome.faa file?


In [2]:
%matplotlib inline

import matplotlib.pyplot as plt

In [8]:
sizes = []
for seq in SeqIO.parse('../data/proteome.faa', 'fasta'):
    sizes.append(len(seq))
    
plt.hist(sizes, bins=100)
plt.xlabel('protein size')
plt.ylabel('count');


Can you count the number of CDS sequences in the data/ecoli.gbk file?


In [9]:
counter = 0

for seq in SeqIO.parse('../data/ecoli.gbk', 'genbank'):
    for feat in seq.features:
        if feat.type == 'CDS':
            counter += 1

counter


Out[9]:
4319

Can you compute the average root-to-tip distance in the data/tree.nwk file?


In [11]:
from Bio import Phylo

tree = Phylo.read('../data/tree.nwk', 'newick')

distances = []
for node in tree.get_terminals():
    distances.append(tree.distance(tree.root, node))
    
sum(distances)/float(len(distances))


Out[11]:
0.4553809170833998

Networkx

Can you read the yeast protein interaction network in data/yeast.gml? Can you plot the degree distribution of the proteins contained in the graph?


In [15]:
import networkx as nx

In [16]:
graph = nx.read_gml('../data/yeast.gml')

In [23]:
plt.hist(nx.degree(graph).values(), bins=20)
plt.xlabel('degree')
plt.ylabel('count');